Search | VHL Regional Portal

1.

Recognize the Value of the Sum Score, Psychometrics' Greatest Accomplishment.

Sijtsma, Klaas; Ellis, Jules L; Borsboom, Denny.

Psychometrika ; 89(1): 84-117, 2024 Mar.

Article in English | MEDLINE | ID: mdl-38627311

ABSTRACT

The sum score on a psychological test is, and should continue to be, a tool central in psychometric practice. This position runs counter to several psychometricians' belief that the sum score represents a pre-scientific conception that must be abandoned from psychometrics in favor of latent variables. First, we reiterate that the sum score stochastically orders the latent variable in a wide variety of much-used item response models. In fact, item response theory provides a mathematically based justification for the ordinal use of the sum score. Second, because discussions about the sum score often involve its reliability and estimation methods as well, we show that, based on very general assumptions, classical test theory provides a family of lower bounds several of which are close to the true reliability under reasonable conditions. Finally, we argue that eventually sum scores derive their value from the degree to which they enable predicting practically relevant events and behaviors. None of our discussion is meant to discredit modern measurement models; they have their own merits unattainable for classical test theory, but the latter model provides impressive contributions to psychometrics based on very few assumptions that seem to have become obscured in the past few decades. Their generality and practical usefulness add to the accomplishments of more recent approaches.

Subject(s)

Psychometrics , Psychometrics/methods , Humans , Reproducibility of Results , Models, Statistical

2.

Proof of Reliability Convergence to 1 at Rate of Spearman-Brown Formula for Random Test Forms and Irrespective of Item Pool Dimensionality.

Ellis, Jules L; Sijtsma, Klaas.

Psychometrika ; 2024 Mar 12.

Article in English | MEDLINE | ID: mdl-38472632

ABSTRACT

It is shown that the psychometric test reliability, based on any true-score model with randomly sampled items and uncorrelated errors, converges to 1 as the test length goes to infinity, with probability 1, assuming some general regularity conditions. The asymptotic rate of convergence is given by the Spearman-Brown formula, and for this it is not needed that the items are parallel, or latent unidimensional, or even finite dimensional. Simulations with the 2-parameter logistic item response theory model reveal that the reliability of short multidimensional tests can be positively biased, meaning that applying the Spearman-Brown formula in these cases would lead to overprediction of the reliability that results from lengthening a test. However, test constructors of short tests generally aim for short tests that measure just one attribute, so that the bias problem may have little practical relevance. For short unidimensional tests under the 2-parameter logistic model reliability is almost unbiased, meaning that application of the Spearman-Brown formula in these cases of greater practical utility leads to predictions that are approximately unbiased.

3.

Solving the many-variables problem in MICE with principal component regression.

Costantini, Edoardo; Lang, Kyle M; Sijtsma, Klaas; Reeskens, Tim.

Behav Res Methods ; 56(3): 1715-1737, 2024 Mar.

Article in English | MEDLINE | ID: mdl-37540467

ABSTRACT

Multiple Imputation (MI) is one of the most popular approaches to addressing missing values in questionnaires and surveys. MI with multivariate imputation by chained equations (MICE) allows flexible imputation of many types of data. In MICE, for each variable under imputation, the imputer needs to specify which variables should act as predictors in the imputation model. The selection of these predictors is a difficult, but fundamental, step in the MI procedure, especially when there are many variables in a data set. In this project, we explore the use of principal component regression (PCR) as a univariate imputation method in the MICE algorithm to automatically address the many-variables problem that arises when imputing large social science data. We compare different implementations of PCR-based MICE with a correlation-thresholding strategy through two Monte Carlo simulation studies and a case study. We find the use of PCR on a variable-by-variable basis to perform best and that it can perform closely to expertly designed imputation procedures.

Subject(s)

Algorithms , Humans , Computer Simulation , Surveys and Questionnaires , Monte Carlo Method

4.

Is it fun or is it hard? Studying physician-related attributes of shared decision-making by ranking case vignettes.

Spinnewijn, Laura; Aarts, Johanna; Braat, Didi; Baranov, Nikolaj; Sijtsma, Klaas; Ellis, Jules; Scheele, Fedde.

PEC Innov ; 3: 100208, 2023 Dec 15.

Article in English | MEDLINE | ID: mdl-37727700

ABSTRACT

Objective: This study investigated provider-related attributes of shared decision-making (SDM). It studied how physicians rank SDM cases compared to other cases, taking 'job satisfaction' and 'complexity' as ranking criteria. Methods: Ten vignettes representing three cases of SDM, three cases dealing with patients' emotions and four with technical problems were designed to conduct a modified ordinal preference elicitation study. Gynaecologists and trainees ranked the vignettes for 'job satisfaction' or 'complexity'. Results were analysed by comparing the top three and down three ranked cases for each type of case using exact p-values obtained with custom-made randomisation tests. Results: Participants experienced more satisfaction significantly from performing technical cases than cases dealing with emotions or SDM. Moreover, technical cases were perceived as less complex than those dealing with emotions. However, results were inconclusive about whether gynaecologists find SDM complex. Conclusion: Findings suggest gynaecologists experience lower satisfaction with SDM tasks, possibly due to them falling outside their comfort zone. Integrating SDM into daily routines and promoting culture change favouring dealing with non-technical problems might help mitigate issues in SDM implementation. Innovation: Our novel study assesses SDM in the context of task appraisal, illuminating the psychology of health professionals and providing valuable insights for implementation science.

5.

A Test to Distinguish Monotone Homogeneity from Monotone Multifactor Models.

Ellis, Jules L; Sijtsma, Klaas.

Psychometrika ; 88(2): 387-412, 2023 06.

Article in English | MEDLINE | ID: mdl-36933110

ABSTRACT

The goodness-of-fit of the unidimensional monotone latent variable model can be assessed using the empirical conditions of nonnegative correlations (Mokken in A theory and procedure of scale-analysis, Mouton, The Hague, 1971), manifest monotonicity (Junker in Ann Stat 21:1359-1378, 1993), multivariate total positivity of order 2 (Bartolucci and Forcina in Ann Stat 28:1206-1218, 2000), and nonnegative partial correlations (Ellis in Psychometrika 79:303-316, 2014). We show that multidimensional monotone factor models with independent factors also imply these empirical conditions; therefore, the conditions are insensitive to multidimensionality. Conditional association (Rosenbaum in Psychometrika 49(3):425-435, 1984) can detect multidimensionality, but tests of it (De Gooijer and Yuan in Comput Stat Data Anal 55:34-44, 2011) are usually not feasible for realistic numbers of items. The only existing feasible test procedures that can reveal multidimensionality are Rosenbaum's (Psychometrika 49(3):425-435, 1984) Case 2 and Case 5, which test the covariance of two items or two subtests conditionally on the unweighted sum of the other items. We improve this procedure by conditioning on a weighted sum of the other items. The weights are estimated in a training sample from a linear regression analysis. Simulations show that the Type I error rate is under control and that, for large samples, the power is higher if one dimension is more important than the other or if there is a third dimension. In small samples and with two equally important dimensions, using the unweighted sum yields greater power.

Subject(s)

Models, Theoretical , Psychometrics/methods , Regression Analysis , Linear Models

6.

How to combine rules and commitment in fostering research integrity?

Labib, Krishma; Tijdink, Joeri; Sijtsma, Klaas; Bouter, Lex; Evans, Natalie; Widdershoven, Guy.

Account Res ; : 1-27, 2023 Mar 20.

Article in English | MEDLINE | ID: mdl-36927256

ABSTRACT

Research integrity (RI) is crucial for trustworthy research. Rules are important in setting RI standards and improving research practice, but they can lead to increased bureaucracy; without commensurate commitment amongst researchers toward RI, they are unlikely to improve research practices. In this paper, we explore how to combine rules and commitment in fostering RI. Research institutions can govern RI using markets (using incentives), bureaucracies (using rules), and network processes (through commitment and agreements). Based on Habermas' Theory of Communicative Action, we argue that network processes, as part of the lifeworld, can legitimize systems - that is, market or bureaucratic governance modes. This can regulate and support RI practices in an efficient way. Systems can also become dominant and repress consensus processes. Fostering RI requires a balance between network, market and bureaucratic governance modes. We analyze the institutional response to a serious RI case to illustrate how network processes can be combined with bureaucratic rules. Specifically, we analyze how the Science Committee established at Tilburg University in 2012 has navigated different governance modes, resulting in a normatively grounded and efficient approach to fostering RI. Based on this case, we formulate recommendations to research institutions on how to combine rules and commitment.

7.

A tutorial on Bayesian single-test reliability analysis with JASP.

Pfadt, Julius M; Bergh, Don van den; Sijtsma, Klaas; Wagenmakers, Eric-Jan.

Behav Res Methods ; 55(3): 1069-1078, 2023 04.

Article in English | MEDLINE | ID: mdl-35581436

ABSTRACT

The current practice of reliability analysis is both uniform and troublesome: most reports consider only Cronbach's α, and almost all reports focus exclusively on a point estimate, disregarding the impact of sampling error. In an attempt to improve the status quo we have implemented Bayesian estimation routines for five popular single-test reliability coefficients in the open-source statistical software program JASP. Using JASP, researchers can easily obtain Bayesian credible intervals to indicate a range of plausible values and thereby quantify the precision of the point estimate. In addition, researchers may use the posterior distribution of the reliability coefficients to address practically relevant questions such as "What is the probability that the reliability of my test is larger than a threshold value of .80?". In this tutorial article, we outline how to conduct a Bayesian reliability analysis in JASP and correctly interpret the results. By making available a computationally complex procedure in an easy-to-use software package, we hope to motivate researchers to include uncertainty estimates whenever reporting the results of a single-test reliability analysis.

Subject(s)

Software , Humans , Bayes Theorem , Reproducibility of Results , Uncertainty

8.

[Plea for the use of objective data when assessing clinical trainees]. / Verantwoord beoordelen in de medische praktijkopleiding.

Scheele, Fedde; Sijtsma, Klaas.

Ned Tijdschr Geneeskd ; 1672023 11 28.

Article in Dutch | MEDLINE | ID: mdl-38175549

ABSTRACT

Medical education literature regarding assessment is traditionally based on psychometric insights. In psychometry, validity and reliability are essential parameters of assessment. A literature movement originating from the Maastricht school, which is extremely influential in the Netherlands and beyond, argues that the clinical context is too complex for traditional standardized assessment methods, and proposes to optimize the use of expert opinions. In this commentary we address the risk of bias and noise in human judgement. Multiple assessors in a clinic may share the same bias regarding certain trainees or specific behaviors. We make a plea for the use of objective data, like results from knowledge tests and objective structured clinical exams, when assessing clinical trainees. The combination of both objective data and an aggregation of expert opinions may be most feasible in clinical assessment, both for learning and for entrustment of professional activities.

Subject(s)

Ambulatory Care Facilities , Education, Medical , Humans , Reproducibility of Results , Educational Status , Judgment

9.

A causal theory of error scores.

van Bork, Riet; Rhemtulla, Mijke; Sijtsma, Klaas; Borsboom, Denny.

Psychol Methods ; 2022 Jul 25.

Article in English | MEDLINE | ID: mdl-35878074

ABSTRACT

In modern test theory, response variables are a function of a common latent variable that represents the measured attribute, and error variables that are unique to the response variables. While considerable thought goes into the interpretation of latent variables in these models (e.g., validity research), the interpretation of error variables is typically left implicit (e.g., describing error variables as residuals). Yet, many psychometric assumptions are essentially assumptions about error and thus being able to reason about psychometric models requires the ability to reason about errors. We propose a causal theory of error as a framework that enables researchers to reason about errors in terms of the data-generating mechanism. In this framework, the error variable reflects myriad causes that are specific to an item and, together with the latent variable, determine the scores on that item. We distinguish two types of item-specific causes: characteristic variables that differ between people (e.g., familiarity with words used in the item), and circumstance variables that vary over occasions in which the item is administered (e.g., a distracting noise). We show that different assumptions about these unique causes (a) imply different psychometric models; (b) have different implications for the chance experiment that makes these models probabilistic models; and (c) have different consequences for item bias, local homogeneity, and reliability coefficient α and the test-retest correlation. The ability to reason about the causes that produce error variance puts researchers in a better position to motivate modeling choices. (PsycInfo Database Record (c) 2022 APA, all rights reserved).

10.

Bayesian Estimation of Single-Test Reliability Coefficients.

Pfadt, Julius M; van den Bergh, Don; Sijtsma, Klaas; Moshagen, Morten; Wagenmakers, Eric-Jan.

Multivariate Behav Res ; 57(4): 620-641, 2022.

Article in English | MEDLINE | ID: mdl-33759671

ABSTRACT

Popular measures of reliability for a single-test administration include coefficient α, coefficient λ2, the greatest lower bound (glb), and coefficient ω. First, we show how these measures can be easily estimated within a Bayesian framework. Specifically, the posterior distribution for these measures can be obtained through Gibbs sampling - for coefficients α, λ2, and the glb one can sample the covariance matrix from an inverse Wishart distribution; for coefficient ω one samples the conditional posterior distributions from a single-factor CFA-model. Simulations show that - under relatively uninformative priors - the 95% Bayesian credible intervals are highly similar to the 95% frequentist bootstrap confidence intervals. In addition, the posterior distribution can be used to address practically relevant questions, such as "what is the probability that the reliability of this test is between .70 and .90?", or, "how likely is it that the reliability of this test is higher than .80?" In general, the use of a posterior distribution highlights the inherent uncertainty with respect to the estimation of reliability measures.

Subject(s)

Bayes Theorem , Probability , Reproducibility of Results , Uncertainty

11.

Advances in nonparametric item response theory for scale construction in quality-of-life research.

Sijtsma, Klaas; van der Ark, L Andries.

Qual Life Res ; 31(1): 1-9, 2022 Jan.

Article in English | MEDLINE | ID: mdl-34751897

ABSTRACT

We introduce the special section on nonparametric item response theory (IRT) in Quality of Life Research. Starting from the well-known Rasch model, we provide a brief overview of nonparametric IRT models and discuss the assumptions, the properties, and the investigation of goodness of fit. We provide references to more detailed texts to help readers getting acquainted with nonparametric IRT models. In addition, we show how the rather diverse papers in the special section fit into the nonparametric IRT framework. Finally, we illustrate the application of nonparametric IRT models using data from a questionnaire measuring activity limitations in walking. The real-data example shows the quality of the scale and its constituent items with respect to dimensionality, local independence, monotonicity, and invariant item ordering.

Subject(s)

Quality of Life , Humans , Psychometrics , Quality of Life/psychology , Surveys and Questionnaires

12.

Rejoinder: The Future of Reliability.

Sijtsma, Klaas; Pfadt, Julius M.

Psychometrika ; 86(4): 887-892, 2021 12.

Article in English | MEDLINE | ID: mdl-34533765

ABSTRACT

In this rejoinder, we examine some of the issues Peter Bentler, Eunseong Cho, and Jules Ellis raise. We suggest a methodological solid way to construct a test indicating that the importance of the particular reliability method used is minor, and we discuss future topics in reliability research.

Subject(s)

Research Design , Psychometrics , Reproducibility of Results

13.

Part II: On the Use, the Misuse, and the Very Limited Usefulness of Cronbach's Alpha: Discussing Lower Bounds and Correlated Errors.

Sijtsma, Klaas; Pfadt, Julius M.

Psychometrika ; 86(4): 843-860, 2021 12.

Article in English | MEDLINE | ID: mdl-34387809

ABSTRACT

Prior to discussing and challenging two criticisms on coefficient [Formula: see text], the well-known lower bound to test-score reliability, we discuss classical test theory and the theory of coefficient [Formula: see text]. The first criticism expressed in the psychometrics literature is that coefficient [Formula: see text] is only useful when the model of essential [Formula: see text]-equivalence is consistent with the item-score data. Because this model is highly restrictive, coefficient [Formula: see text] is smaller than test-score reliability and one should not use it. We argue that lower bounds are useful when they assess product quality features, such as a test-score's reliability. The second criticism expressed is that coefficient [Formula: see text] incorrectly ignores correlated errors. If correlated errors would enter the computation of coefficient [Formula: see text], theoretical values of coefficient [Formula: see text] could be greater than the test-score reliability. Because quality measures that are systematically too high are undesirable, critics dismiss coefficient [Formula: see text]. We argue that introducing correlated errors is inconsistent with the derivation of the lower bound theorem and that the properties of coefficient [Formula: see text] remain intact when data contain correlated errors.

Subject(s)

Psychometrics , Reproducibility of Results

14.

A Guide for Sparse PCA: Model Comparison and Applications.

Guerra-Urzola, Rosember; Van Deun, Katrijn; Vera, Juan C; Sijtsma, Klaas.

Psychometrika ; 86(4): 893-919, 2021 12.

Article in English | MEDLINE | ID: mdl-34185214

ABSTRACT

PCA is a popular tool for exploring and summarizing multivariate data, especially those consisting of many variables. PCA, however, is often not simple to interpret, as the components are a linear combination of the variables. To address this issue, numerous methods have been proposed to sparsify the nonzero coefficients in the components, including rotation-thresholding methods and, more recently, PCA methods subject to sparsity inducing penalties or constraints. Here, we offer guidelines on how to choose among the different sparse PCA methods. Current literature misses clear guidance on the properties and performance of the different sparse PCA methods, often relying on the misconception that the equivalence of the formulations for ordinary PCA also holds for sparse PCA. To guide potential users of sparse PCA methods, we first discuss several popular sparse PCA methods in terms of where the sparseness is imposed on the loadings or on the weights, assumed model, and optimization criterion used to impose sparseness. Second, using an extensive simulation study, we assess each of these methods by means of performance measures such as squared relative error, misidentification rate, and percentage of explained variance for several data generating models and conditions for the population model. Finally, two examples using empirical data are considered.

Subject(s)

Algorithms , Computer Simulation , Principal Component Analysis , Psychometrics

15.

Steps toward preregistration of research on research integrity.

Sijtsma, Klaas; Emons, Wilco H M; Steneck, Nicholas H; Bouter, Lex M.

Res Integr Peer Rev ; 6(1): 5, 2021 Mar 01.

Article in English | MEDLINE | ID: mdl-33648609

ABSTRACT

BACKGROUND: A proposal to encourage the preregistration of research on research integrity was developed and adopted as the Amsterdam Agenda at the 5th World Conference on Research Integrity (Amsterdam, 2017). This paper reports on the degree to which abstracts of the 6th World Conference in Research Integrity (Hong Kong, 2019) reported on preregistered research. METHODS: Conference registration data on participants presenting a paper or a poster at 6th WCRI were made available to the research team. Because the data set was too small for inferential statistics this report is limited to a basic description of results and some recommendations that should be considered when taking further steps to improve preregistration. RESULTS: 19% of the 308 presenters preregistered their research. Of the 56 usable cases, less than half provided information on the six key elements of the Amsterdam Agenda. Others provided information that invalidated their data, such as an uninformative URL. There was no discernable difference between qualitative and quantitative research. CONCLUSIONS: Some presenters at the WCRI have preregistered their research on research integrity, but further steps are needed to increase frequency and completeness of preregistration. One approach to increase preregistration would be to make it a requirement for research presented at the World Conferences on Research Integrity.

16.

Precision and Sample Size Requirements for Regression-Based Norming Methods for Change Scores.

Gu, Zhengguo; Emons, Wilco H M; Sijtsma, Klaas.

Assessment ; 28(2): 503-517, 2021 03.

Article in English | MEDLINE | ID: mdl-32336114

ABSTRACT

To interpret a person's change score, one typically transforms the change score into, for example, a percentile, so that one knows a person's location in a distribution of change scores. Transformed scores are referred to as norms and the construction of norms is referred to as norming. Two often-used norming methods for change scores are the regression-based change approach and the T Scores for Change method. In this article, we discuss the similarities and differences between these norming methods, and use a simulation study to systematically examine the precision of the two methods and to establish the minimum sample size requirements for satisfactory precision.

Subject(s)

Sample Size , Computer Simulation , Humans , Regression Analysis

17.

Standardized Protocol Items Recommendations for Observational Studies (SPIROS) for Observational Study Protocol Reporting Guidelines: Protocol for a Delphi Study.

Mahajan, Raman; Burza, Sakib; Bouter, Lex M; Sijtsma, Klaas; Knottnerus, André; Kleijnen, Jos; Dael, Peter Van; Zeegers, Maurice P.

JMIR Res Protoc ; 9(10): e17864, 2020 Oct 21.

Article in English | MEDLINE | ID: mdl-33084592

ABSTRACT

BACKGROUND: Approximately 90% of currently published clinical and public health research is in the form of observational studies. Having a detailed and registered study protocol prior to data collection is important in any empirical study. Without this, there is no reliable way to assess the occurrence of publication bias, outcome reporting bias, and other protocol deviations. However, there is currently no solid guidance available on the information that a protocol for an observational study should contain. OBJECTIVE: The aim of this study is to formulate the Standardized Protocol Items Recommendations for Observational Studies (SPIROS) reporting guidelines, which focus on 3 main study designs of analytical epidemiology: cohort, case-control, and cross-sectional studies. METHODS: A scoping review of published protocol papers of observational studies in epidemiology will identify candidate items for the SPIROS reporting guidelines. The list of items will be extended with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) checklist items and recommendations from the SPIROS steering committee. This long list serves as the basis for a 2-round Delphi survey among experts to obtain consensus on which items to include. Each candidate item from the long list will be rated on a 5-point Likert scale to assess relevance for inclusion in the SPIROS reporting guidelines. Following the Delphi survey, an expert-driven consensus workshop will be convened to finalize the reporting guidelines. RESULTS: A scoping review of published observational study protocols has been completed, with 59 candidate items identified for inclusion into the Delphi survey, itself launched in early 2020. CONCLUSIONS: This project aims to improve the timeliness, completeness, and clarity of study protocols of observational studies in analytical epidemiology by producing expert-based recommendations of items to be addressed. These reporting guidelines will facilitate and encourage researchers to prepare and register study protocols of sufficient quality prior to data collection in order to improve the transparency, reproducibility, and quality of observational studies. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): PRR1-10.2196/17864.

18.

Longitudinal associations between trait neuroticism and negative daily experiences in adolescence.

Borghuis, Jeroen; Bleidorn, Wiebke; Sijtsma, Klaas; Branje, Susan; Meeus, Wim H J; Denissen, Jaap J A.

J Pers Soc Psychol ; 118(2): 348-363, 2020 Feb.

Article in English | MEDLINE | ID: mdl-30676043

ABSTRACT

It is well established that trait neuroticism bears strong links with negative affect and interpersonal problems. The goal of this study was to examine the longitudinal associations between neuroticism and daily experiences of negative affect and interpersonal problems during the developmentally important period of adolescence. Dutch adolescents and their best friends (N = 1,046) completed up to 6 yearly personality trait questionnaires and up to 15 between-year assessment bursts between the ages 13 and 18. During each assessment burst, participants reported on 5 consecutive days about their experiences of negative affect and interpersonal conflict with their mother and their best friend. We estimated a series of multilevel random-intercept cross-lagged panel models to differentiate covariance at the level of constant between-person differences from dynamic processes that occurred within persons. At the level of constant between-person differences, higher neuroticism was associated with more negative daily experiences. At the within-person level, yearly changes in neuroticism were bidirectionally and positively associated with yearly changes in daily negative affect. The most parsimonious, best fitting models did not contain a random intercept for daily conflict with friend and adolescents' contingency between daily experiences of conflict with mother and negative affect. Rank-order differences in these variables were positively associated with subsequent within-person changes in neuroticism. We discuss these results with regard to endogenous versus dynamic theories of personality development and the value of using a differentiated statistical approach. (PsycINFO Database Record (c) 2020 APA, all rights reserved).

Subject(s)

Affect , Friends/psychology , Interpersonal Relations , Neuroticism , Adolescent , Family Conflict/psychology , Female , Humans , Longitudinal Studies , Male , Netherlands , Surveys and Questionnaires

19.

Using Confidence Intervals for Assessing Reliability of Real Tests.

Oosterwijk, Pieter R; van der Ark, L Andries; Sijtsma, Klaas.

Assessment ; 26(7): 1207-1216, 2019 10.

Article in English | MEDLINE | ID: mdl-29084436

ABSTRACT

Test authors report sample reliability values but rarely consider the sampling error and related confidence intervals. This study investigated the truth of this conjecture for 116 tests with 1,024 reliability estimates (105 pertaining to test batteries and 919 to tests measuring a single attribute) obtained from an online database. Based on 90% confidence intervals, approximately 20% of the initial quality assessments had to be downgraded. For 95% confidence intervals, the percentage was approximately 23%. The results demonstrated that reported reliability values cannot be trusted without considering their estimation precision.

Subject(s)

Confidence Intervals , Psychological Tests/standards , Reproducibility of Results , Belgium , Databases, Factual , Humans , Netherlands

20.

Item-Score Reliability in Empirical-Data Sets and Its Relationship With Other Item Indices.

Zijlmans, Eva A O; Tijmstra, Jesper; van der Ark, L Andries; Sijtsma, Klaas.

Educ Psychol Meas ; 78(6): 998-1020, 2018 Dec.

Article in English | MEDLINE | ID: mdl-30542214

ABSTRACT

Reliability is usually estimated for a total score, but it can also be estimated for item scores. Item-score reliability can be useful to assess the repeatability of an individual item score in a group. Three methods to estimate item-score reliability are discussed, known as method MS, method λ 6 , and method CA. The item-score reliability methods are compared with four well-known and widely accepted item indices, which are the item-rest correlation, the item-factor loading, the item scalability, and the item discrimination. Realistic values for item-score reliability in empirical-data sets are monitored to obtain an impression of the values to be expected in other empirical-data sets. The relation between the three item-score reliability methods and the four well-known item indices are investigated. Tentatively, a minimum value for the item-score reliability methods to be used in item analysis is recommended.

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL